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Abstract 

It is shown that a simple Dirichlet process mixture of multivariate normals offers 
Bayesian density estimation with adaptive posterior convergence rates. Toward this, a 
novel sieve for non-parametric mixture densities is explored, and its rate adaptability 
to various smoothness classes of densities in arbitrary dimension is demonstrated. This 
sieve construction is expected to offer a substantial technical advancement in studying 
Bayesian non-parametric mixture models based on stick-breaking priors. 
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1 Introduction 

Asymptotic frequentist properties of Bayesian non-parametric methods have received a lot of 
attention in recent years. It is now recognized that a single, fully automatic Bayesian model 

can offer adaptive, optimal rates of convergence for large collections of true data generating 

distri butions, ranging over several smoothness classes. In a seminal work. Ivan der Vaart and van Zanten 
(2009) establish adaptability of rescaled Gaus sian proce s s mo dels for non-parametric re- 
gression, classification and density estimation. Rousseau ( 20ld ) discusses adaptive density 
estimatio n with finite beta m ixtur es with a hierarchical prior on th e number of mixture com- 
ponents. iKruijer et al. I (|20ld ) and Ide Jonge and van Zantenl (j20ld ) derive similar results foi 
finite location-scale mixture models, respectively, in density estimation and regression, again 
with a prior on the number of mixture components. 

Quite interestingly, adaptability has not yet been established for Dirichlet process (DP) 
mixture of normals models for density estimation. Even rates of convergence of these models 
remain to be derived beyond the univariate case. This is surprising because these models 
are the most studied of all Bayesian non-parametric models, and have been a mong the firsts 



for which positive results about convergence of the posterior were established (jGhosal et al 
19991 : iGhosal and van der Vaartl . l200ll . I2007T ) . 



The main challenge in establishing adaptability of DP mixture models and to derive rates 
of convergence in higher dimensions lies in constructing a suitable low- entropy, high-mass 
sieve on the space of non-parametric mixture densities. Such sieve constructions are an 
integral part of the current technical machinery for deri ving rates of convergence. The sie ves 
that have been used to study DP mixture models (e.g., in lGhosal and van der V aart. 20 071) do 



not s cale to higher dimensions and lack adaptability to smoothness classes (|Wu and Ghosal 
20ld ). 



I 



The main import of this article is to plug this gap. It is demonstrated that a novel sieve 



const ruction proposed by this author (reported earlier in an yet unpublished paper lPati et al 



201 ll ) give the desired dimension-scalability and smooth ness-adaptab i lity. T his sieve utilizes 



the well known stick-breaking representation of a DP (jSethuramanl . 1 19941 ) and can be po- 
tentially useful for studying a large cla s s of stick-breaking mixtur e models beyond the DP 



mixt ures (e.g., iDunson and Parkl . [2008]; IChung and Dunsonl . 120091 : iRodriguez and Dunson . 



2011 



This sieve paves way to the following results. For independent and identically distributed 
observations X\, ■ ■ ■ ,X n from an unknown density p on M d , posterior convergence rates are 
derived for a simple DP location mixture model at a true data generating density po belonging 
to either a class of infinitely differentiable densities or a class of compactly supported densities 
with two continuous derivatives. The derived rates are minimax optimal for these classes (up 
to logarithmic factors), and adapt to these two classes without requiring any user intervention 
to select or estimate any tuning parameters. 

The two classes considered here form two extremes of the classes of smooth densities. 
Finer rate adaptability results can be derived by looking at the intermediate classes of Holder 
smooth densities. These classes have well defined minimax optimal rates associated with 
them. It is demonstrated that the new sieve works for all Holder classes. However, we stop 
short of deriving precise rates of convergence for these classes. This derivation requires an 
additional calculation of prior thickness rates for a po belonging to these classes, which is a 
challenging and interesting problem but is tangential to the focus of this article. Interested 
readers are referred to some recent developments reported in lKruijer et al. I (l2010h . 



2 A simple DP location mixture model 



Let <j) a denote the density of the d-variate normal distribution with mean zero and variance 
a 2 I. For any probability measure F on M. d , use pp >a to denote the mixture density 



PF,a(x) = / 4> a (x ~ z)dF(z), X 6 



(1) 



Assign p a prior distribution n given by the law of the random density pp ifT when (F,a~ d ) ~ 
DP(a) x Ga(a,b) where DP(a) denotes the Dirichlet process distribution ( Ferguson . 19731 ) 
with base measure a and Ga(a, b) denotes the gamma distribution with shape a and rate b. 

It is useful t o recall two diffe rent characterizations of DP distributions, the original char- 
acterization by Ferguson! ( 19731 ) through a consistent system of Dirichl et distributions ove r 
measurable partitions and the later stick-breaking interpretation due to Sethuraman ( 19941 ). 
The first approach characterizes anF~ DP(a), with a a finite measure on 10 '' 



as: 



(F(fli),-- - ,F(B k ))~ DiriaiBi),--- ,a(B k )). 



(2) 



for any Borell measurable partition Bi, ■ ■ ■ ,B k of M d . The stick breaking characterization 
says an 



F = Kh,8z h i = Vh JT(1 — Vj), 5 X = Dirach measure at x, 

h=l j<h 



(3) 
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has a DP (a) distribution if {Vh,h > 1} are independent Be(l, \a\) random variables with 
\a\ = a(K. d ), {Zh,h > 1} are independently distributed according to the probability measure 
a = a/\a\ and these two sets of random variables are mutually independent. 

The base measure a gives the mean of F, and also determines its support. The only 
assumptions we make on a are that it admits a Lebesgue density that is strictly positive over 
the whole of M. d and that for some constant b±, a([—a, a] d ) < exp(— 61a 2 ), where /(a) < g(a) 
means f{a) < Kg{a) for all a, for some fixed constant K. 



3 Posterior convergence rates and adaptability 

Consider modeling d-variate measurements X\ , X<i , • • • as independent observations from a 
density p, which is assigned a prior distribution II. Here II is a probability measure on the 
space V of Lebesgue probability densities, equipped with the Borel cx-field under a metric p, 
usually taken to be the L\ metric p(p,q) = \\p — q\\i = f Rd \p(x) — q(x)\dx or the Hellinger 
metric p(p,q) = h(p,q) = [J Rd {p 1/2 (x) - q l / 2 {x)} 2 } 1/2 . 

Let II„(-|Xi, • • • , X n ) denote the posterior distribution of p based on the first n measure- 
ments, defined for every measurable B C V as 

Un{BlXlr " ,Xn) ~ UU^PmnidpY 

Let {e n }n>i be a sequence of positive numbers with linin^ocEn = 0. For any po G V we say 
the posterior convergence rate at po is (not slower than) e n if for some finite constant M 

Iimn({p : p(p ,p) > Me n }\Xx,- ■ ■ ,X n ) = (4) 

n— s-0 

almost surely whenever X\,X2,--- are independent and identically distributed (iid) with 
density po. 

Although ([I]) only establishes {e n }n>i as a bound on the convergence rate, it serves as a 
useful calibration of the method induced by LT for classes of true densities po for which optimal 
estimation rates are known. For example, for various classes of infinitely differentiable densi- 



ties t he optimal rate is known to be n 1 / 2 (log n) k for some k > ( Ibragimov and Khas'minskiil . 



19831 ). whereas for the class of compactly support ed, twic e cont inuously differentiable den- 



sities, the optimal rate is known to be n~ 2 ^ A+d ^ { Huand . l2004j ) . A method is considered 
adaptive if it provides convergence rates that are within a power of logn of these optimal 
rates. Along this line, we present the following results. 

Theorem 1. Let LT be the DP mixture prior of Section® 

1. If po equals Pf ,ct f or some probability measure Fq on M. d and some ao > 0, then 
holds with e n = n~ 1 / 2 (log n)( d+1+s ^ 2 for every s > 0. Such a po will be called a super- 
smooth density. 

2. If po is compactly supported and twice continuously differentiable then ([4]) holds with 
e n = n -2// ( 4+rf )(logn)( 4rf+2 - ) /( <i+4 ) +s for every s > 0. Such apo will be called an ordinary- 
smooth density. 
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These results are proved in Sections 0] and El The main t ool needed to establish Q 
is a set of sufficient conditions proposed in IGhosal et al.1 (12000. Theorem 2 .1). W e present 
here a slightly modified version adapted from IGhosal and van der Vaartl (|200ll . Theorem 
2.1). In the following, for any two probability densities p and q and any positive number 
e, we denote K(p,q) = f Rd p(x)log{p(x)/q(x)}dx, V(p,q) = f Rd p(x)[log{p(x) / q(x)}} 2 dx , 
B(e;p) = {q € V : K(p,q) < e 2 ,V(p,q) < e 2 }. For any Q C V, its e-covering number 
N(s, Q, p) is defined to be the minimum number of balls of radius e (in the metric p) needed 
to cover Q; with log N(e, Q, p) referred to as the e-entropy of Q. 







Theorem 2. Let p be the Hellinger metric on V . Suppose for positive sequences e n , e n 
with nmin(e^,4) — > oo, there exist positive constants ci, 02,03,04 and sets V n C V, n > 1, 
such that for all large n 

log N(s n ,V n ,p) < cinel, (5) 



U(B(e n ;p )) > c 4 e" 



■c 2 ne„ 



(6) 
(7) 



Then 



holds with e n = max(e n ,e n ) 



Remark 1. If Q holds with p = the Hellinger metric then it holds with p = the L\ metric, 
because for any two probability density ||p — q\\i < 2h{p, q). 

It is common to call the sequence {V n } n >i a sieve on V . The first two conditions require 
existence of a low-entropy, high mass sieve. The third condition requires a quantitative bound 
on the thickness of the prior IT at the true density p$. We first take up the more challenging 
task of sieve construction for the DP mixture prior of Section [21 followed by prior thickness 
calculations. 



4 Sieve construction 

4.1 The basic construct 



The chief novelty of the sieve proposed in lPati et al.l (120 111 ) lies in exploiting the stick-breaking 
representation of a DP distribution. A high-mass, low-entropy subset of V can be obtained 
by considering densities PF,a, with F as given in ([3]) with limited tail mass YlhyH^h- A 
precise statement is given below. 

Theorem 3. Fix reals £,a,a> and integers M,H > 1. Define 

Q = \ PF,a ■F = Y J *hSz h : z h e [-a, a] d , h < H; ir h < e; 1< - < (1 + e) M \ . (8) 

^ h=l h>H — > 

Then, for some positive constants 61,62 an d 63, 

1. log N(e, Q, p) < dH log ^ + H log - + log M , where p is either the L\ or the Hellinger 
metric. 

2. If 11 is the DP mixture prior of Section\g then U(Q C ) < He~ bia2 + e~ h2S -~ d + q_- bzd {l + 
e)- b:idM + {(e\a\/H)log(l/e)} H . 
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Proof. Let R* be a (ae)-net of [—a, a] d and let 5* be an e-net of the //-simplex Sh = {p = 
(pi, • • • : Ph > 0, YlhPh = ^ * s wen known that the size of i?* is < {a/(ae)} d and 

that of S* is < (l/e) H . For any Pf,u € Q, with F = ££° =1 ^<5 2h , find ,4 ^ 

7r* = (7r*, • • • ,ir* H ) e S* and m* G {1, • • • , M} such that 

max \\zh — zt\\ < ue, (9) 
i<h<H " n " — 

> [ttt/j. — 7r£ I < e, where 7^ = — , 1 < h < H, and (10) 

<T* = ct(1 + e) m * satisfies 1 < a/a* < 1 + e. (11) 
Then, with F* = Ylh=l n h^zZ ' we nave ' 

||P.F> -PF*,<r*||l < \\PF,a -PF,u*\\l + \\PF,a* ~ PF*,a*\\l 

H H 

< — — + ^2 7r h + ^*h\\<f>ff»(--Zh)-4>o*(--Zh)\\l + '%2Wh-K\- 



h>H h=l h=l 



Each of the first three terms above is smaller than or equal to e. The last term is smaller 

than or equal to (1 - J2h>H ^h) E/T=i ~ K\ + J2h>H EfcLi K ^ 2e Tnus a 5e-net of 
Q, in the L\ topology, can be constructed with p* = pf*,ct* as above. The total number of 
such p* is < {j^) dH \\) H M. This proves the first assertion of the theorem with p = || • ||i; the 
constant multiplication by 5 can be absorbed in < form of the bound. The same obtains for 
p = the Hellinger metric because it is bounded by the square-root of the L\ metric. 

Now with II denoting the DP mixture prior of Section [21 we have a stick-breaking rep- 
resentation of a random p ~ II given by p = pF, a = EftLi 7r /i0o-(- — Z^j with 7T/j and as 
described in and the paragraph that follows, and a~ ~ Ga(a,b). Therefore, 



n(Q c ) < Ha([-a, a] d ) + Pr(a 2 g* (a 2 ,a 2 (l + e) 2M )) + Pr ( £ n h > e ) 

\h>H J 



:i2) 



The first term is < .ff exp(— &ia 2 ), by assumption on a. The second term equals Pr(d d > 
a- d )+Pr(a~ d < q_- d {l+e)- Md ) < exp{-l/b 2 a- d ) + (a d (l+e) Md )- b -- i because a~ d ~ Ga(a,6). 
To bound the last term in (fT2|) . note that W = — E^=i l°g(l — ^h) ~ Ga(H, \a\), and therefore 
the last term equals 

Pr(W < log(l/e )) < (|a| log i) H /r(^ + 1) < log H 

by Stirling's formula. This proves the second assertion. □ 



4.2 Sieves for Theorem [T] 

The subset Q of Theorem [3] can be easily adapted to form sieves targeted for different rates 
of convergence. Below we show this for the nearly parametric, super-smooth rate and also 
for the slower rates associated with Holder classes of finitely differentiable functions. All this 
is done for any arbitrary dimension d > 1. 
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Proposition 1 (Super-smooth rate). Fix any s > 0. For e n = re _1//2 (log n)^ d+1 ^ 2 and 
£n = £n(log n) s l 2 , there is a sequence of sets V n such that log N(e n , V n , p) < ne 2 n andIl(V^) < 
exp(— cne 2 ) for every c > 0, where p is either the L\ or the Hettinger metric. 

Proof. Let V n be defined as Q of ([8]) with e = e n = n -1 / 2 (log n)( d+1+s ^ 2 , H = ne 2 /logn = 
(logn) d+s , and M = a 2 = a~ d = n. Then, by Theorem El 

log N(s n ,V n ,p) < d(logn) d+s+l + (logn) d+s+1 +logn 
<(logn)^ +1 =n4 



which proves the first assertion. Also, 

< (logn) d+s e- bin + e- b2n + n fe 3 e -63dnlog(i+^) + ( logn )-(^-i)(l°g«) d+s . (13) 



For any c > 0, the first, second and fourth terms on the right hand side of (113f) are clearly 
bounded by Cexp(— c(log n) d+s ) for some constant C. The third term, too, is bounded by 
the same, possibly with different C because nlog(l + e n ) > ne 2 n = (log n) s (log n) d+1 > 
c(logn) d+s . And therefore IKV^) ^ exp(— c(log n) d+1 ). This proves the second assertion of 
the theorem. □ 

Proposition 2 (Holder-smooth sieve). Fix any /3 £ (0,1/2), q > and s > 0. For e n = 

n~^(logn)' 3 ; e n = e n (logn) s , there is a sequence of sets V n such that log N(e n , V n , p) < ne 2 n 
and n("P^) < exp(— cne 2 ) for every c > 0, where p is either the L\ or the Hettinger metric. 

Proof. Let V n be defined as on the right hand side of ([8]) with e = e n = n~^(logn) <?+s , = 
ne 2 n /\ogn = n 1_2/3 (log n) 2 ^ +s ) _1 , M = a 2 = a _d = n. Then by Theorem El log N(s n , V n , p) < 
n 1_2/3 (log n) 2 ( q+s ^ and for every c > 0, 

n(P C ) < n 1 " 2/3 (logn) 2(9+s) ^ 1 e" bin + e~ b2U + n b S e -b3dn\o g (l+e n ) + n -(l-2/3)n 1 - 2 ' 3 (logn) 2 (9+ a )- 1 
< -(l~2/3)n 1 " 2 ' 3 (logn) 2 ('J+ s ) < -en 1 " 2 ' 3 (log n) 2 <? 

□ 

The ordinary-smooth rate corresponds to (3 = 2/(4 + d), and more generally, a Holder 
class of functions with continuous derivatives up to order k corresponds to (3 = k/(2k + d). 



5 Prior Thickness 

With sieve conditions ([5]) , © taken care of, a proof of Theorem Q] requires establishing the 
prior thickness property ([7]) of n for each of the two classes of densities. Below we show that 
for apo from either class, H(B(Ai n ;po)) > e~ cn£n for some constants A > 0, c > ,with e n as 
in Proposition Q] or Proposition [2] as appropriate (with f3 = 2/(4+ d)). This immediately leads 
to H(B(e n ;po)) > e~ C2n£n for some finite number C2 > and completes a proof of Theorem[H 
with e n = e n , because Propositions Q] and [2] hold for all constants c > 0, including, c = C2 + 4, 
as needed by Theorem [2l 

We will first tackle prior thickness at ordinary-smooth densities po which present a bigger 
challenge than the super-smooth ones. Our proof closely follows the calculations presented 
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in Ghosal and van der Vaartl (|2007h with some minor adaptation needed to handle higher 



dimensions. For this reason, most of the results are presented in the Appendix, with proofs 
given only for those where some adaptation is needed. However, we present the m a in ar 



gument below, because a similar argument presented in iGhosal and van der Vaartl (|2007l . 
Section 9) leaves some gaps (pun intended). 

Proposition 3 (Ordinary-smooth thickness). Suppose pq is compactly supported and 

(\\^Po\\/Po) A PodX < oo, J (\\V 2 poh/Po) 2 Pod\ < oo, 

where \\A\\2 denotes the spectral norm of a matrix A. Then H(B (Ai n ; po)) > e~~ cn£n with 
e n = n -2/(4+d)(i ogn )(4c(+2)/(d+4) j or SQme cons t an t s A > 0,c> 0. 

Proof. Fix a a 2 G e n {\og(l / e n )}- 2 ■ (1/2, 1). Find a b > 1 such that e b n {\og{l / e)} 9 / * < e n . 
Let Po denote the probability measure associated with the density po- By Corollary [TJ there 
is a discrete probability measure F a = Y^jLiPj^zj with at most N < a~ d log(l/e n ) d support 
points in [— a, a] d , with at least ae~ separation between any Z{ ^ Zj, such that 

\\pp ,a ~ PF^Joo < £ 2 n/o- d+1 and \\p Po ,a - p F „A\i £ e^{log(l/e n ) } 1/2 . 

Place disjoint balls Uj with centers at Zj, j = 1, ■ ■ • , N with diameter ere 26 each. Extend 
{U\, • • • , t/jv} to a partition {U±, ■ ■ ■ , Uk} of [—a, a] d such that each Uj, j = N + 1, • • • , K, 
has diameter smaller than or equal to a. This can be done with K < o-~ d {log(l/£ n )} d . 
Further extend this to a partition U±, ■ ■ ■ , Um of W 1 such that (ai 2 j } ) d < oc(Uj) < 1 for all 
j = 1, • • • ,M. We can still have M < cj- d {log(l/e n )} d < e ? ^ a!/2 {log(l/e n )} M . Define Pj = 0, 
j = N + l,---,M. 

Let V a denote the set of probability measures F on R d with \ F (Uj)-pj\< 2e 2 n db and 

m^i<j<M F(Uj) > if l db /2. Then, by Lemma [2] (with Vi = U { , % = 1, • • • , N, V = U j>N Vj) 
for any F G V a , \\p Fa , a ~PF,a\\°o < ^f/^, \\PF a ,a -PF,a\\l < £ 26 and hence, by Lemmadand 
Lemma [IJ 

h(P0,PF,a) < h(po,PP 0>a ) + h(p Po>a ,p Fl7 ,o-) + HPF„ ,a, PF,a) 

<a 2 + e t{log(l/^)} 1 / 4 + £ t 

< C x 2 + et{log(l/e ri ,)} 1/4 

Also, for any such P, for every x G [—a, a] d with J(x) denoting the j G {1, • • • , K} such that 
x £ Uj, 



f If 1 e 4 ' 

PF,.(x) > / Mx ~ z)dF(z) > - / dF(z) > -F{U J(X) ) > 

J\\z~x\\<a a J\\x-z\\<a ° a 



=4db 

T 



because, Uj^, with diameter no larger than a, must be a subset of the ball of radius a 
around x. So F G V a implies log ||po/pf,o-||oo ^ log(l/e n ) and therefore, by Lemma 
K{po,Pf,o) < A 2 ^ and V(po,pF, a ) < A 2 ^, for a universal constant ^4 > that does not 
depend on a. 

Note that Me 2db < e 2<ib_ ^ 2 {log(l/e n )} 2d < 1 and for some large constant a\ > 0, 
e 2db < a 1 {min 1 < j < M a(C/ J )} 2 / 3 . So, by Lemma El Pr(F G P CT ) > C exp ( -cM log l/e n ) > 
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Cexp(— ce n d ^ 2 {log(l/e n )} 2d+1 ), for some constants C, c that depend on a(M. d ),a,d and b. 
Therefore, 

U(B(Ae n ;p )) > eM-ce- d/2 {log(l/e n )} 2d+1 ) Pr(a 2 G e n {log(l/f n )}- 2 • (1/2, 1)) 

= eM-c£n d/2 {log(l/i n )} 2d+1 ) Pi(a d G et /2 {log(l/en)}^ ■ (l/2 d , 1)) 
> exp(-ce- d / 2 {log(l/i n )} 2d+1 ) 

because a~ d has a gamma distribution. 

From this the result follows if e^ 2 {log(l/£ n )} M+1 < ne 2 n . With e n = n _2// ( 4+d ) (log n) 9 , 
we get ne 2 = n d /( 4+d )(logn) 2 " and C d/2 {log(l/e n )} M+1 < n d ^ 4+d ^ (\ ogn fd+i-d q /2 an d hence 
the condition is satisfied if 2d + 1 - dq/2 < 2q, i.e., if q > (Ad + 2)/(d + 4). □ 

Prior thickness calculation at a super-smooth po follows along the same line, but is simpler 
because we can bypass the first step in the proof of Proposition [3] of approximating po by a 
PF,a- In fact, thi s approximati o n is t he main driver of the slower thickness rate i n , the recent 
developments in iKruiier et aD ifcoid ) are about refining this approximation for densities that 
have higher order derivatives. 

Proposition 4 (Super-smooth thickness). If po = PF ,a for some Fq supported on [—a,a] d , 
then Tl(B(Ae n ;po)) > e~ cn£n with e n = re -1 ' 2 (log n)( d+1 ^ 2 for some constants A, c > 0. 

Proof. Fix a a G a • (1 - e n {\og{\ / e n )}- 2 , 1). Fix b > 1 such that e& {log(l/e n )} 9 / 4 < £ n . 
Construct V a as before, but with PF ,a instead of pp Q ,a- Because a is bounded from below 
by o~q/2, this can be constructed with an M < {\og(l/e n )} d and hence Pr(F G V a ) > 
exp(— c{log(l/e)} rf+1 ) for some constant c. Note that 

IlPO -PF a ,<r\\l < \\P0 ~ PF ,a\\l + \\PF ,v ~ PF a ,M < ^ ~ 0~ / CFq + {log(l / E n )} 1/2 

< e n {log(l/e n )}- 2 + e 2b {log(l/£n)} 1/2 

and therefore, F G V a implies K(po,pF,a) < A 2 ^ an d ^(P0)Pf,<t) < ^4 2 ^ 2 f° r some universal 
constant A > that does not depend on <r. Now, because Pr(cr G oo(l— e n {log(l/e n )}~ 2 , 1)) > 
e n {\og(l/e n )y 2 > exp(-{log(l/e n )} c!+1 ) we have p n > exp(-c{log(l/e r n )} o!+1 ). From this 
the result follows if {log(l/e n )} d+1 < ne 2 , which is satisfied with e n = n~ 1//2 (logn) 9 for 
2q>d+l. □ 



A Appendix: Supporting results and proofs 



Theorem 4. Let Pq be a probability measure on [—a, a] 
there is a discrete probability measure F a on [—a,a] d 
l}log(l/e)] d support points such that \\pp , a — PF a ,a\\o~ 
e{log(l/e)} 1//2 ; for some universal constant D. 



For any e > and a > 0, 



1 C 1 

with at most N ae 
< e/a d and \\p 



D[{(a/a) V 



Po,a - PF a ,a\\l 



< 



Pro of. A proof of this result c a n be obtained through s traightforward extensions of Lem ma 
2 of Gh osal and van der Vaartl (120071 ) and Lemma 3.1 of lGhosal and van der Vaartl (120011) to 
d dim ensions. The only subtlety lies in replacing display (3.9) of lGhosal and van der Vaart 
(|200lh with 

d 



j z l dF(z) = J z l dF'{z), I G {1, • • • ,2k — 2}° 



(14) 
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where, for a z = (zi, ■ ■ ■ , z&) € M d and a I = (li, ■ ■ ■ , Id) £ Z d , z z denotes ^j 1 ^ 2 ■ • • z£. For 
any probability distribution F on IR rf , there exists a discrete distribution F' with at most 
{2(k — l)} d + 1 support points, satisfying (fl4"|) . This power of d propagate all through the 
require extensions and appears in N a>£ in the statement of the current theorem. □ 

Corollary 1. Let Po be a probability measure on [— a,a] d C M. d . For any e > and a > 
0, there is a discrete probability measure F£ on [ — tx, <x] with at most A^-^ — D[{(a/o~) V 
l}log(l/e)] d support points from the set {(ni,-- - ,n p )ae : rii € Z, \rii\ < = 1,-'' ,p} 

such that \\pp ,a -PF2,a\\oc < z/o- d and ||yp , CT ~PF*,a\\i ^ e{log(l/e)} 1/2 . 

Proof. First get F<j as in Theorem |4] and then move each of its support points to the nearest 
point on the grid {(n%, • • • ,n p )ae : n« € Z, |nj| < = 1, • • • ,p} to get F^. These moves 

cost at most a constant times e 2 /a d to the supremum norm distance and at most a constant 
times e to the L\ distance. □ 

Lemma 1. Let po be a twice continuously differentiable probability density on M. d and let Pq 
denote the corresponding probability measure. Lf 



(II Vpo||/po) PodX < oo and J (||V poh/po) PodX < oo, 
where \\A\\2 denotes the spectral norm of a matrix A, then h(po,pp 0)(7 ) < a 2 . 



Proof . The proof below closely follows the proof of Lemma 4 in iGhosal and van der Vaart 



(|2007h with some adaptation needed to handle d > 1. By the assumptions on po, po and 



Vpo are uniformly bounded and hence p a (x) := pp Qt<T {x) = J po(x — ay)cp(y)dy is twice 
continuously differentiable in a with derivatives p<t(x) and p a (x) given by 



Pa(x) = - J y'Vp {x - ay)(j)(y)dy 
P<j{x) = j {y'^ 2 Po(x - cry)y}(j)(y)dy. 
Using Taylor's theorem with the integral form of the remainder we have 

2p Q ' (X) Z JO \psa (X) Z Psa{x)J 

Because po(x) = — j y'Vpo(x)(j)(y)dy = for every x, we obtain 




2 



\p l Jv{ x ) 2 P% 2 (x) 

Psa{x) \ if pj a {x) 

P l ja(x) J 4 \pl /2 (x) 



dx x (1 — s) 2 ds 
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Now, for any a, by the Cauchy-Schwarz inequality, 

I y'V 2 p (x - ay)y(y) 1/2, 



pl(x) 



1/2, 


t 

- °~y) 


" {y'V 2 Po{x 


-°y)yf 


Po{x - 


- <?y) 


* \\S7 2 p {x - 





< J —j- — 4>(y)dy x p a (x) 

<-S 

and hence $ (p sa (x) /p sa (x)) 2 dx < / (|| V 2 p ||2/po) 2 Po^A x / \\y 4 \\0(y)dy < 1. 
By Holder's inequality with p = 4 and q = 4/3, 



Pt(x) 



< 



)4 



\\ Vp (x - ay) \\ A 

Po( x - °y) 



<( / " T, IT \\ym)dy)x P l(x 



and hence f(p 2 a (x)/p 3 s i 2 (x)) 2 dx < f(\\Vp \\/po) 4 Pod\ x / ||y||V(y)*/ < 1. □ 

Lemma 2. Lei Vo, Vi, • • ■ , Vat be a partition of~U. d and F' = Ylj=iPj$Zj a probability measure 
on W d with Zj G Vj, j = 1, • • ■ , N. Then, for any probability measure F on M. d , and any a > 0, 



A' 

"""ij-Pil 



\\pF,a - PfvIIoc < -gzr max . diam(Vj) + -j ^ |F(Vj 

1 ? 

Hpf.o- - Pf>||i ^ - m.ax . diam(V 7 ) + V l-F(V^) - pj 

O" l<7<iv 1 — ' 



where diam(A) := sup-fH^i — Z2II : £ ^4} denotes the diameter of a set A. 

Proof. See the proof of Lemma 5 of iGhosal and van der Vaartl (j2007h . □ 



Lemma 3 (Lemma 10 of lGhosal and van der Vaartl (|2007l )). Let (X±, • • • , Xjy) ~ Dir(a\, ■ ■ ■ , 
< ay < 1, Yli=i a j = 771 ■ Fix a > 0; b > 0. Then, there exist constants c and C that 
only depend a, b and m such that for any e G (0,min(l/4, a{min,- ctj} b , 1/N)), 



P \ X i — Pil < 2£,minX,- > e 2 / 2 j > Cexp ^- 



■cN log - 

e 
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Lemma 4. For every pair of probability densities p and q, 

Plog^ < h 2 {p,q) (l + log 



P 

q 

1/2 



db - ^ II i < ^(p,?) < Hp - ^ II i • 

Proof. See Lemma 8 of iGhqsal and van der Vaartl (|2007l ) for the first two inequalities. The 
last set is well known, (e.g., van der Vaartl . 19981 . page 212). □ 
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